Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Yulin Luo^1,2*, Chun-Kai Fan^1*, Menghang Dong^1*, Jiayu Shi^1,2*, Mengdi Zhao^3*, Bo-Wen Zhang^4*, Cheng Chi² Jiaming Liu¹, Gaole Dai¹, Rongyu Zhang¹, Ruichuan An¹, Kun Wu⁵, Zhengping Che⁵, Shaoxuan Xie², Guocai Yao², Zhongxia Zhao^1,2, Pengwei Wang², Guang Liu², Zhongyuan Wang², Tiejun Huang^1,2, Shanghang Zhang^1,2✉

¹ State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University,
² Beijing Academy of Artificial Intelligence, ³ Institute for Brain and Intelligence, Fudan University,
⁴ University of Science and Technology Beijing, ⁵ Beijing Innovation Center of Humanoid Robotics
^*Equal contribution, ^✉Corresponding author

Paper Code Dataset

Overview of RoboBench. We evaluates MLLMs as embodied brains across 5 dimensions, 14 subdimensions, and 25 tasks, with tasks color-coded by type (top left). These dimensions follow the embodied execution pipeline (bottom)—from understanding intent, perceiving the environment, planning and adapting actions, refining subgoals via affordances, diagnosing failures—capturing the core cognitive roles of System 2. Performance comparison (top right) highlights significant gaps among state-of-the-art MLLMs, with Gemini-2.5-Pro achieving the best results.

Abstract

Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System~2 handles high-level reasoning while System~1 executes low-level control. In this work, we refer to System~2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential for advancing robotic intelligence. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions—instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis—spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, multi-view scenes, and memory-driven navigation, drawing from large-scale real robotic data and in-house collection. For planning, RoboBench introduces an evaluation framework that uses an MLLM as a world simulator. It moves beyond symbolic matching to evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes under physical and visual constraints, enabling faithful assessment of long-horizon reasoning. Experiments on 14 state-of-the-art MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, clarify the role of the embodied brain, and guide the development of next-generation MLLMs toward more robust robotic intelligence.

News

🔥 2025.10.23 - Dataset released on Hugging Face! Check it out at https://huggingface.co/datasets/LeoFan01/RoboBench❗️
🔥 2025.10.21 - The paper has been released! Code and dataset are being organized and will be released soon. Stay tuned❗️

Highlight

🔍 Benchmark Overview

The first comprehensive benchmark focused on evaluating MLLMs as embodied brains.
Systematic evaluation across 5 core dimensions, 14 capabilities, 25 task types, and 6,092 high-quality questions.

🧭 Comprehensive Dimensions

Covers key embodied skills tailored to MLLM capabilities, including instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis in real-world settings.

🛡️ Robust Evaluation

All questions are manually verified for quality and consistency.
Long-horizon task planning is evaluated using a novel Directed Acyclic Graph (DAG)-guided approach to ensure rigor and robustness.

🧠 Real-world Data

Built on the latest open-source real-robot datasets and proprietary real-world data.
Evaluation tasks are grounded in realistic embodied interaction scenarios.

🌍 Diverse Composition

Sourced from a wide range of data and scenarios.
Captures the complexity and diversity of real-world embodied tasks.

Leaderboard

Model	Perception Reasoning
	Robotic-centric		Object-centric		Scene-centric			Task-centric	Avg
	Robot-type▼	Robot-view▼	Static Attr.▼	Functional Attr.▼	Spatial Relation▼	Temp. Grounding▼	Causality▼	Refer. Comprehen.▼	Avg
Basic Reference
Human Evaluation	80.67	79.08	43.77	83.89	70.91	51.61	91.22	93.22	74.30
GPT-4o-text-only	20.51	13.77	5.18	35.37	25.74	18.32	25.52	22.09	20.81
Closed-Source MLLMs
GPT-4o-Mini	38.75	18.84	26.43	53.66	30.36	22.65	34.25	39.67	33.08
GPT-4o	64.96	39.38	24.92	46.75	42.24	20.61	33.10	41.31	39.16
Claude-3.5-Sonnet	41.31	36.23	29.13	62.60	34.98	21.88	36.09	25.36	35.95
Claude-3.7-Sonnet	40.46	32.37	45.20	71.14	36.63	21.09	40.92	28.02	39.48
Gemini-2.0-Flash	56.69	20.77	49.08	78.46	42.57	21.37	51.72	72.40	49.13
Gemini-2.5-Flash	62.39	39.38	55.02	77.24	57.43	33.58	70.34	74.64	58.75
Gemini-2.5-Pro	64.30	41.71	54.83	82.27	60.44	49.68	71.73	78.68	62.96
Qwen-VL-Plus	28.21	21.74	34.63	58.54	27.72	21.37	31.03	34.36	32.20
Qwen-VL-Max	47.86	43.48	39.70	75.20	50.17	27.45	37.93	41.53	45.42
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B	30.34	23.68	37.08	49.66	27.27	18.42	23.65	19.21	28.66
LLaVA-OneVision-7B	44.83	30.26	33.43	75.84	45.45	23.68	25.68	44.63	40.48
Qwen2.5-VL-7B-Ins	23.93	26.81	37.86	46.34	31.68	22.90	14.48	36.81	30.10
Qwen2.5-VL-72B-Ins	47.72	42.75	41.74	72.95	48.51	27.87	40.32	42.13	45.50
Embodied MLLMs
RoboBrain-2.0-7B	44.97	24.84	40.43	79.19	48.18	23.48	41.22	53.67	44.50

Model	Instruction Comprehension			Generalized Planning
	Explicit▼	Implicit▼	Avg▼	Cross-Embodiment Planning				Cross-Object Planning			Cross-View Planning		Cross-Task Planning	Avg
				Single-arm▼	Dual-arm▼	Mobile-manip.▼	Human▼	Material Afford.▼	Physical Attr.▼	World Knowl.▼	Multi▼	Single▼	Navigation Plan.▼
Basic Reference
Human Evaluation	59.94	61.13	60.54	72.50	41.93	41.55	62.28	56.70	58.98	49.36	52.82	51.59	45.23	54.50
GPT-4o-text-only	38.80	11.10	24.95	26.70	33.32	43.65	37.86	36.58	22.33	37.68	44.35	38.11	36.90	33.95
Closed-Source MLLMs
GPT-4o-Mini	41.21	14.95	28.08	27.47	25.21	37.98	31.72	33.75	38.46	42.56	39.11	33.29	34.04	33.31
GPT-4o	45.60	19.04	32.32	28.28	32.65	52.69	35.71	39.93	46.09	41.34	38.51	33.66	39.41	37.74
Claude-3.5-Sonnet	42.11	14.85	28.48	30.18	33.65	50.29	41.05	38.28	40.67	39.63	45.95	40.43	39.77	38.07
Claude-3.7-Sonnet	47.77	14.53	31.15	29.86	38.69	50.39	37.06	38.65	41.86	51.83	48.19	44.51	39.95	41.68
Gemini-2.0-Flash	43.49	16.38	29.93	28.67	33.66	48.27	33.95	40.76	54.27	40.12	46.13	40.73	37.02	38.62
Gemini-2.5-Flash	42.53	17.10	29.82	27.05	40.46	49.91	34.50	39.87	53.37	46.22	39.41	43.29	38.32	39.33
Gemini-2.5-Pro	51.15	19.60	35.37	29.71	37.65	50.96	37.44	39.29	56.50	43.29	47.35	45.12	43.62	41.81
Qwen-VL-Plus	37.77	10.38	24.07	24.68	21.75	32.98	33.91	28.45	33.55	33.78	30.95	28.60	4.39	26.77
Qwen-VL-Max	46.45	16.98	31.71	28.30	35.73	47.79	32.40	40.44	44.33	42.32	41.79	37.68	38.00
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B	6.82	1.24	3.61	2.90	4.57	4.77	3.68	4.77	3.47	6.47	4.30	3.62	11.39	4.83
LLaVA-OneVision-7B	18.93	3.48	10.05	11.48	16.23	8.27	5.34	18.51	15.62	8.10	0.00	15.16	24.67	12.15
Qwen2.5-VL-7B-Ins	26.45	4.65	15.55	19.47	12.90	28.75	28.19	22.06	21.63	25.61	11.79	20.12	2.10	18.64
Qwen2.5-VL-72B-Ins	46.81	15.15	30.98	28.20	36.92	49.14	31.31	40.51	44.94	38.90	43.16	40.24	37.47	37.73
Embodied MLLMs
RoboBrain-2.0-7B	36.93	8.19	22.51	15.46	25.32	32.72	31.81	19.85	30.85	23.24	31.51	23.89	24.53	25.35

Model	Affordance Prediction				Failure Analysis
Model	Static▼	Dynamic▼	Naviga.▼	Avg▼	Execution▼	Planning▼	Avg▼
Basic Reference
Human Evaluation	86.08	80.02	81.85	82.63	47.30	80.67	63.99
GPT-4o-text-only	44.89	40.70	38.19	39.88	25.17	37.93	31.55
Closed-Source MLLMs
GPT-4o-Mini	50.64	42.88	42.30	46.39	17.66	44.60	31.13
GPT-4o	55.61	49.14	49.91	51.91	22.29	57.01	39.65
Claude-3.5-Sonnet	56.26	54.25	53.84	54.77	16.12	47.52	31.82
Claude-3.7-Sonnet	60.02	52.38	50.07	54.06	18.32	54.24	36.28
Gemini-2.0-Flash	61.65	61.76	66.89	63.37	28.48	59.80	44.14
Gemini-2.5-Flash	61.20	52.04	52.01	54.29	18.54	67.65	43.10
Gemini-2.5-Pro	70.54	62.03	63.96	65.21	15.96	74.31	45.14
Qwen-VL-Plus	51.74	37.42	47.97	48.18	13.91	40.00	26.96
Qwen-VL-Max	70.01	56.26	50.85	59.43	17.22	57.93	37.58
Open-Source Multi-Image MLLMs
LLaVA-OneVision-0.5B	20.56	28.56	27.69	24.76	21.19	24.67	22.93
LLaVA-OneVision-7B	23.83	33.61	33.43	30.29	29.14	34.00	31.56
Qwen2.5-VL-7B-Ins	49.73	38.03	42.16	43.15	13.91	26.90	20.41
Qwen2.5-VL-72B-Ins	71.54	51.94	47.67	56.67	12.59	50.72	31.66
Embodied MLLMs
RoboBrain-2.0-7B	51.87	54.63	41.61	49.37	7.95	42.00	41.24

Key Findings from RoboBench Evaluation

Overall Findings

🥇 Gemini-2.5-Pro Leads but Still Trails Humans

Gemini-2.5-Pro achieves the strongest overall performance across all five cognitive dimensions. It notably scores 62.96 in perception reasoning and 65.21 / 45.14 in affordance and failure analysis—well above other models but still far below human levels (74.30 / 82.63 / 63.99). This underscores a persistent gap between current MLLMs and robust human-level embodied intelligence.

🔒 Closed-Source Models Still Hold the Advantage

Closed-source MLLMs outperform open-source ones in four out of five dimensions, often by 10–15%. Open-source models only approach parity in perception reasoning. Within each family, larger models consistently perform better, e.g., GPT-4o > GPT-4o-mini, Claude-3.7 > Claude-3.5.

🤖 Embodied Training Brings Noticeable Gains

The embodied MLLM RoboBrain-2.0-7B surpasses similarly sized general open-source models in perception reasoning, planning, and affordance prediction. This validates the effectiveness of domain-specific embodied datasets for improving multimodal reasoning and planning.

📊 Cognitive Difficulty Varies Across Dimensions

Perception reasoning yields the highest accuracies; Generalized planning remains the most challenging, exposing weaknesses in long-horizon reasoning and structured task decomposition. The contrast highlights where future progress is most needed.

Fine-grained Findings

🧠 Implicit Intent Understanding Remains a Major Challenge

Performance on implicit instructions drops by about 30% compared to explicit ones. Models struggle to infer goals from indirect human demands, revealing weak integration of language, perception, and context.

👁️ Perception and Temporal Reasoning Bottlenecks

Models misidentify robot types or viewpoints and fail to localize events in time. Temporal and causal reasoning accuracies hover around 30–40%, except for Gemini series. Stronger embodiment-aware perception and spatiotemporal reasoning modules are needed.

🧩 Planning Limitations Persist

Cross-embodiment: poor coordination in dual-arm or mobile manipulation. Cross-object: difficulty with rare or knowledge-dependent objects. Cross-view: multi-image inputs markedly improve performance (e.g., +5–7 points for GPT-4o / Claude-3.7), showing the promise of multi-view reasoning.

⚙️ Failure Analysis Is Extremely Hard

Diagnosing execution-level errors is far more difficult than planning-level ones (scores 10–20 vs. 40–60). Requires fine-grained spatial and physical understanding (e.g., distinguishing location vs. rotation errors). Even humans achieve only 47.3 on such tasks, underscoring their intrinsic complexity.

Dataset Construction Pipeline

Dataset Construction Pipeline. RoboBench integrates open-source and self-collected robot data under a shared process—preprocessing → tool-assisted + human-in-the-loop annotation → unified schema → auto-generated QA—and builds datasets for five dimensions: Instruction Comprehension: pair explicit instructions with LLM-rewritten implicit variants to test intent understanding. Perception Reasoning: use captioning/detection/segmentation tools to draft labels across robotic/object/scene/task views, then human-refine and standardize. Generalized Planning: construct a planning pool from robot videos; VLMs produce step/timestamp summaries and metadata, which are mapped to function templates to support Q1/Q2/Q3 evaluations. Affordance Prediction: sample key frames and annotate static (contact points), dynamic (trajectories), and mobile (base positions) affordances. Failure Analysis: mine execution-level failures from real trials and synthesize planning-level errors by perturbing correct instructions. All outputs follow one schema and are rendered into binary, single-choice, and multi-step multiple-choice QA formats for open- and closed-source MLLMs.

Planning Evaluation Pipeline

Planning Evaluation Framework. Evaluation of the planning dimension (Q1–Q3). Each task is decomposed into a sequence of parameterized atomic actions forming a Directed Acyclic Graph (DAG) that encodes causal and temporal dependencies. For Q1 (Long-horizon planning), an MLLM-based world simulator assesses both NodeCorrectness (action alignment) and TaskCompletion (goal-state achievement) by simulating action rollouts under visual and physical constraints. Q2 (Next-step planning) evaluates fine-grained step prediction by comparing skill, object, and parameter accuracy, while Q3 (Task state estimation) measures binary correctness on whether a subtask has been completed. Together, the pipeline provides a unified, interpretable framework for assessing structural correctness and embodied feasibility in planning.

Demo Case

BibTeX

@misc{luo2025robobenchcomprehensiveevaluationbenchmark,
                title={Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain}, 
                author={Yulin Luo and Chun-Kai Fan and Menghang Dong and Jiayu Shi and Mengdi Zhao and Bo-Wen Zhang and Cheng Chi and Jiaming Liu and Gaole Dai and Rongyu Zhang and Ruichuan An and Kun Wu and Zhengping Che and Shaoxuan Xie and Guocai Yao and Zhongxia Zhao and Pengwei Wang and Guang Liu and Zhongyuan Wang and Tiejun Huang and Shanghang Zhang},
                year={2025},
                eprint={2510.17801},
                archivePrefix={arXiv},
                primaryClass={cs.RO},
                url={https://arxiv.org/abs/2510.17801},
            }